This is Part C of APAN5205 Group 6’s final report.
This part is the same as in the final report so we hide the code and output.
## 'data.frame': 42656 obs. of 6 variables:
## $ Review_ID : int 670772142 670682799 670623270 670607911 670607296 670591897 670585330 670574142 670571027 670570869 ...
## $ Rating : int 4 4 4 4 4 3 5 3 2 5 ...
## $ Year_Month : chr "2019-4" "2019-5" "2019-4" "2019-4" ...
## $ Reviewer_Location: chr "Australia" "Philippines" "United Arab Emirates" "Australia" ...
## $ Review_Text : chr "If you've ever been to Disneyland anywhere you'll find Disneyland Hong Kong very similar in the layout when you"| __truncated__ "Its been a while since d last time we visit HK Disneyland .. Yet, this time we only stay in Tomorrowland .. AKA"| __truncated__ "Thanks God it wasn t too hot or too humid when I was visiting the park otherwise it would be a big issue (t"| __truncated__ "HK Disneyland is a great compact park. Unfortunately there is quite a bit of maintenance work going on at prese"| __truncated__ ...
## $ Branch : chr "Disneyland_HongKong" "Disneyland_HongKong" "Disneyland_HongKong" "Disneyland_HongKong" ...
## Rows: 42,656
## Columns: 6
## $ Review_ID <int> 670772142, 670682799, 670623270, 670607911, 67060729…
## $ Rating <int> 4, 4, 4, 4, 4, 3, 5, 3, 2, 5, 5, 5, 4, 5, 5, 3, 4, 3…
## $ Year_Month <chr> "2019-4", "2019-5", "2019-4", "2019-4", "2019-4", "2…
## $ Reviewer_Location <chr> "Australia", "Philippines", "United Arab Emirates", …
## $ Review_Text <chr> "If you've ever been to Disneyland anywhere you'll f…
## $ Branch <chr> "Disneyland_HongKong", "Disneyland_HongKong", "Disne…
## integer(0)
## [1] 2613 6
## [1] 40043 6
## [1] 4 3 5 2 1
## average_rating median_rating
## 1 4.231102 5
##
## 1 2 3 4 5
## 1338 1929 4782 10086 21908
##
## 1 2 3 4 5
## 0.03341408 0.04817321 0.11942162 0.25187923 0.54711185
##
## negative neutral positive
## 3267 4782 31994
## [1] 162
##
## Afghanistan Albania
## 2 6
## Algeria Andorra
## 2 1
## Antigua and Barbuda Argentina
## 1 25
## Armenia Aruba
## 1 2
## Australia Austria
## 4412 27
## Azerbaijan Bahrain
## 2 39
## Bangladesh Barbados
## 12 5
## Belgium Bolivia
## 132 3
## Bosnia and Herzegovina Botswana
## 7 3
## Brazil Brunei
## 94 18
## Bulgaria Cambodia
## 16 7
## Canada Caribbean Netherlands
## 2116 1
## Cayman Islands Chile
## 1 18
## China Colombia
## 167 11
## Cook Islands Costa Rica
## 2 9
## Croatia Cuba
## 16 1
## Curacao Cyprus
## 1 45
## Czechia Democratic Republic of the Congo
## 27 1
## Denmark Dominican Republic
## 82 4
## Ecuador Egypt
## 3 75
## El Salvador Estonia
## 1 9
## Ethiopia Falkland Islands (Islas Malvinas)
## 3 2
## Fiji Finland
## 5 60
## Five Islands France
## 1 223
## French Polynesia Georgia
## 3 2
## Germany Ghana
## 182 2
## Gibraltar Greece
## 8 101
## Grenada Guam
## 1 16
## Guatemala Guernsey
## 8 8
## Haiti Honduras
## 2 2
## Hong Kong Hungary
## 515 23
## Iceland India
## 5 1470
## Indonesia Iran
## 511 26
## Iraq Ireland
## 1 456
## Isle of Man Israel
## 8 113
## Italy Ivory Coast
## 117 2
## Jamaica Japan
## 2 61
## Jersey Jordan
## 14 8
## Kazakhstan Kenya
## 7 16
## Kuwait Laos
## 43 2
## Latvia Lebanon
## 5 56
## Libya Lithuania
## 2 5
## Luxembourg Macau
## 12 35
## Madagascar Malawi
## 1 2
## Malaysia Maldives
## 562 4
## Mali Malta
## 2 80
## Mauritius Mexico
## 27 116
## Moldova Monaco
## 4 2
## Mongolia Montenegro
## 3 4
## Morocco Mozambique
## 4 3
## Myanmar (Burma) Namibia
## 7 1
## Nepal Netherlands
## 6 239
## New Zealand Nicaragua
## 714 1
## Nigeria North Macedonia
## 23 6
## Northern Mariana Islands Norway
## 2 98
## Oman Pakistan
## 23 25
## Panama Papua New Guinea
## 6 1
## Peru Philippines
## 12 1024
## Poland Portugal
## 25 98
## Puerto Rico Qatar
## 17 63
## Romania Russia
## 93 43
## Rwanda Saudi Arabia
## 3 114
## Senegal Serbia
## 1 11
## Seychelles Singapore
## 4 971
## Slovakia Slovenia
## 9 2
## Solomon Islands South Africa
## 2 233
## South Korea South Sudan
## 36 1
## Spain Sri Lanka
## 132 34
## Sudan Suriname
## 1 1
## Sweden Switzerland
## 94 117
## Taiwan Tanzania
## 34 4
## Thailand The Bahamas
## 216 2
## Timor-Leste Trinidad and Tobago
## 1 7
## Tunisia Turkey
## 4 50
## Turks and Caicos Islands U.S. Virgin Islands
## 1 5
## Uganda Ukraine
## 4 8
## United Arab Emirates United Kingdom
## 339 9115
## United States Uruguay
## 13522 7
## Uzbekistan Vanuatu
## 1 2
## Venezuela Vietnam
## 3 55
## Zambia Zimbabwe
## 3 2
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Five Islands
## 'data.frame': 46 obs. of 28 variables:
## $ Ride_name : chr "Alien Swirling Saucers" "Astro Orbiter" "Avatar Flight of Passage" "Big Thunder Mountain Railroad" ...
## $ Park_location : chr "HS" "MK" "AK" "MK" ...
## $ Park_area : chr "Toy Story Land" "Tomorrowland" "Pandora" "Frontierland" ...
## $ Ride_type_all : chr "spinning" "spinning, slow" "thrill" "thirll, small drops" ...
## $ Ride_type_thrill : chr "No" "No" "Yes" "Yes" ...
## $ Ride_type_spinning : chr "Yes" "Yes" "No" "No" ...
## $ Ride_type_slow : chr "No" "Yes" "No" "No" ...
## $ Ride_type_small_drops : chr "No" "No" "No" "Yes" ...
## $ Ride_type_big_drops : chr "No" "No" "No" "No" ...
## $ Ride_type_dark : chr "No" "No" "No" "No" ...
## $ Ride_type_scary : chr "No" "No" "No" "No" ...
## $ Ride_type_water : chr "No" "No" "No" "No" ...
## $ Fast_pass : chr "Yes" "No" "Yes" "Yes" ...
## $ Classic : chr "No" "Yes" "No" "Yes" ...
## $ Age_interest_all : chr "all ages" "all ages" "kids, tweens, teens, adults" "kids, tweens, teens, adults" ...
## $ Age_interest_preschoolers: chr "Yes" "Yes" "No" "No" ...
## $ Age_interest_kids : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Age_interest_tweens : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Age_interest_teens : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Age_interest_adults : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Height_req_inches : int 32 0 44 40 0 40 0 44 0 0 ...
## $ Ride_duration_min : num 1.5 1.5 5 3.5 4 3.25 1.5 2.75 5 8 ...
## $ Open_date : chr "6/30/18" "2/25/95" "5/27/17" "9/23/80" ...
## $ Age_of_ride_days : num 1712 10238 2111 15506 8918 ...
## $ Age_of_ride_years : num 4.69 28.03 5.78 42.45 24.42 ...
## $ Age_of_ride_total : chr "4 years 8 months 7 days" "28 years 0 months 11 days" "5 years 9 months 11 days" "42 years 5 months 14 days" ...
## $ TL_rank : int 31 43 9 8 32 24 29 1 27 47 ...
## $ TA_Stars : num NA 3.5 5 4.5 4.5 4 4.5 5 4 4 ...
## [1] 4
## Warning in cor(rides_cor): the standard deviation is zero
This part is also repeating what we did in part A, so we also hide the code and output.
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## Warning in tm_map.SimpleCorpus(corpus1, FUN = content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = removePunctuation):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = removeWords,
## c(stopwords("english"))): transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = stemDocument): transformation
## drops documents
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
## Selecting by tfidf
## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
#Testing how to extract reviews mentioning specific rides name
library(dplyr)
library(stringr)
rides_name <- rides$Ride_name
rides_name[1]
## [1] "Astro Orbiter"
disneyland_1 <- disneyland %>%
mutate(Astro_Orbiter = case_when(grepl(c("astro orbiter", "Astro", "Orbiter"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Astro_Orbiter = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
ao <- disneyland_1 %>%
filter(Astro_Orbiter == 1)
rides_name[2]
## [1] "Avatar Flight of Passage"
disneyland_2 <- disneyland %>%
mutate(Avatar_Flight = case_when(grepl(c("Avatar Flight of Passage", "Avatar Flight", "Avatar Ride"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Avatar_Flight = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
af <- disneyland_2 %>%
filter(Avatar_Flight == 1)
rides_name[3]
## [1] "Big Thunder Mountain Railroad"
disneyland_3 <- disneyland %>%
mutate(Big_Thunder = case_when(grepl(c("Big Thunder Mountain Railroad", "Big Thunder"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Big_Thunder = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
bt <- disneyland_3 %>%
filter(Big_Thunder == 1)
We want to detect if some review mentioned specific rides name. For
example, if some review mentioned “Astro Orbiter” in
Review_Text, we create a new column called
Astro_Orbiter. If mentioned, we enter 1 under the
Astro_Orbiter column, otherwise we enter 0.
We want to do the same for all 42 rides in the “rides” dataset, and
hence we could match ride features with reviews mentioned the ride.
library(dplyr)
library(stringr)
rides_name[1:14]
## [1] "Astro Orbiter"
## [2] "Avatar Flight of Passage"
## [3] "Big Thunder Mountain Railroad"
## [4] "Buzz Lightyear's Space Ranger Spin"
## [5] "Dinosaur"
## [6] "Dumbo the Flying Elephant"
## [7] "Expedition Everest"
## [8] "Frozen Ever After"
## [9] "Gran Fiesta Tour Starring The Three Caballeros"
## [10] "Haunted Mansion"
## [11] "It's a Small World"
## [12] "Journey Into Imagination with Figment"
## [13] "Jungle Cruise"
## [14] "Kali River Rapids"
disneyland_ride <- disneyland %>%
mutate(Astro_Orbiter = case_when(grepl(c("astro orbiter", "Astro", "Orbiter"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Avatar_Flight = case_when(grepl(c("Avatar Flight of Passage", "Avatar Flight", "Avatar Ride"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Big_Thunder = case_when(grepl(c("Big Thunder Mountain Railroad", "Big Thunder"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Buzz_Lightyear = case_when(grepl(c("Buzz Lightyear's Space Ranger Spin", "Buzz Lightyear's", "Space Ranger Spin", "Space Ranger"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Dinosaur = case_when(grepl(c("Dinosaur"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Dumbo = case_when(grepl(c("Dumbo the Flying Elephant", "Flying Elephant"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Expedition_Everest = case_when(grepl(c("Expedition Everest"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Frozen_Ever = case_when(grepl(c("Frozen Ever After", "Ever After"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Gran_Fiesta = case_when(grepl(c("Gran Fiesta Tour Starring The Three Caballeros", "Gran Fiesta", "Starring The Three Caballeros", "Three Caballeros"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Haunted_Mansion = case_when(grepl(c("Haunted Mansion"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Small_World= case_when(grepl(c("It's a Small World", "Small World"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Journey_Into= case_when(grepl(c("Journey Into Imagination with Figment", "Imagination with Figment"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Jungle_Cruise= case_when(grepl(c("Jungle Cruise"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Kali_River = case_when(grepl(c("Kali River Rapids", "Kali River"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Astro_Orbiter = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Avatar_Flight = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Big_Thunder = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Buzz_Lightyear = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Dumbo = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Frozen_Ever = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Gran_Fiesta = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Small_World = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Journey_Into = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Kali_River = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
#head(disneyland_ride)
rides_name[15:28]
## [1] "Kilimanjaro Safaris" "Living with the Land"
## [3] "Mad Tea Party" "Mission Space"
## [5] "Na'vi River Journey" "Peter Pan's Flight"
## [7] "Pirates of the Caribbean" "Primeval Whirl"
## [9] "Prince Charming Regal Carrousel" "Rock 'n' Roller Coaster"
## [11] "Seven Dwarfs Mine Train" "Soarin' Around the World"
## [13] "Space Mountain" "Spaceship Earth"
disneyland_ride <- disneyland_ride %>%
mutate(Kilimanjaro_Safaris = case_when(grepl(c("Kilimanjaro Safaris", "Kilimanjaro"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Living_With = case_when(grepl(c("Living with the Land"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Mad_Tea = case_when(grepl(c("Mad Tea Party", "Mad Tea"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Mission_Space = case_when(grepl(c("Mission Space"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Navi_River = case_when(grepl(c("Na'vi River Journey", "Na'vi"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Peter_Pan = case_when(grepl(c("Peter Pan's Flight", "Peter Pan's"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Pirates = case_when(grepl(c("Pirates of the Caribbean", "Pirates", "Caribbean"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Primeval_Whirl = case_when(grepl(c("Primeval Whirl"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Prince_Charming = case_when(grepl(c("Prince Charming Regal Carrousel", "Regal Carrousel"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Rock_Roller = case_when(grepl(c("Rock 'n' Roller Coaster", "Rock n"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Seven_Dwarfs = case_when(grepl(c("Seven Dwarfs Mine Train", "Mine Train"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Soarin_Around = case_when(grepl(c("Soarin' Around the World", "Soaring Around", "Soarin' Around"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Space_Mountain = case_when(grepl(c("Space Mountain"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Spaceship_Earthn = case_when(grepl(c("Spaceship Earth"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Kilimanjaro_Safaris = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Mad_Tea = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Navi_River = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Peter_Pan = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Pirates = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Prince_Charming = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Rock_Roller = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Seven_Dwarfs = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Soarin_Around = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
rides_name[29:42]
## [1] "Splash Mountain"
## [2] "Star Tours"
## [3] "Test Track"
## [4] "The Barnstormer"
## [5] "The Magic Carpets of Aladdin"
## [6] "The Many Adventures of Winnie the Pooh"
## [7] "The Twilight Zone Tower of Terror"
## [8] "Tomorrowland Speedway"
## [9] "Tomorrowland Transit Authority PeopleMover"
## [10] "Toy Story Mania"
## [11] "TriceraTop Spin"
## [12] "Under the Sea"
## [13] "Walt Disney World Railroad"
## [14] "Walt Disney's Carousel of Progress"
disneyland_ride <- disneyland_ride %>%
mutate(Splash_Mountain = case_when(grepl(c("Splash Mountain"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Star_Tours = case_when(grepl(c("Star Tours"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Test_Track = case_when(grepl(c("Test Track"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Barnstormer = case_when(grepl(c("The Barnstormer"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Magic_Carpets = case_when(grepl(c("The Magic Carpets of Aladdin", "Magic Carpets"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Winnie_Pooh = case_when(grepl(c("The Many Adventures of Winnie the Pooh", "Many Adventures of Winnie"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Twilight_Zone = case_when(grepl(c("The Twilight Zone Tower of Terror", "Twilight Zone", "Tower of Terror"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Tomorrowland_Speedway = case_when(grepl(c("Tomorrowland Speedway"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Tomorrowland_Transit = case_when(grepl(c("Tomorrowland Transit Authority PeopleMover", "Transit Authority"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Toy_Story = case_when(grepl(c("Toy Story Mania"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(TriceraTop_Spin = case_when(grepl(c("TriceraTop Spin", "Tricera Top"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Under_Sea = case_when(grepl(c("Under the Sea"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(World_Railroad = case_when(grepl(c("Walt Disney World Railroad", "Disney World Railroad"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0)) %>%
mutate(Carousel_Progress = case_when(grepl(c("Walt Disney's Carousel of Progress", "Carousel of Progress"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Magic_Carpets = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Winnie_Pooh = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Twilight_Zone = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Tomorrowland_Transit = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `TriceraTop_Spin = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `World_Railroad = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Carousel_Progress = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
We manually matched ride names and Review_Text because
we want to include possible variation of rides name mentioned in review.
For example, there is a ride called “Seven Dwarfs Mine Train”, so we
detected both “Seven Dwarfs Mine Train” (the official ride name) and
possible abbreviated ride name “Main Train” (also with insensitive
letter case). However, we cannot grantee we have included all
occurrence. Because if there is a typo in visitors’ review, or another
way to call the ride, we would not be able to match them.
There are also ride name including a Disney character, for example
“Peter Pan’s Flight”. We include “Peter Pan’s” as abbreviated ride name.
But we cannot be sure if reviews using such words is mentioning the ride
or the character in Disneyland who is interacting with visitors.
Maybe in future NLP analysis, we can figure out whether “Peter Pan” is
referring to the Disney character or the ride name when we can
successfully capture the context of the review. We hope to find
solutions in future analysis, but for now, we would use the manually
matched dataset.
Because some of the rides only exists in the Orlando Disney, so we
want to exclude the column associated with rides not being mentioned in
any of the Review_Text.
#only keep the row if at least one of the ride name column is 1
disneyland_ride2 <- disneyland_ride %>%
select_if(function(x) !all(x == 0))
#head(disneyland_ride2)
# Plot the number of times each ride being mentioned in reviews:
rides_sum <- disneyland_ride2 %>%
select(11:34) %>%
colSums()
rides_sum <- rides_sum[order(-rides_sum)]
barplot(rides_sum, main = "Number of Times Each Ride Being Mentioned", xlab = "Ride Names", ylab = "Count", col = "lightblue", density = 30, las = 2, cex.names = 0.6, ylim = c(0, 3500))
From the above plot, we know that “Space Mountain”, “Pirates”, “Hunted Mansion”, “Star Tours”, and “Splash Mountain” are the five rides being mentioned the most times.
Given so many rides, we want to analyze “Space Mountain” as it is the ride being mentioned the most in reviews.
#filter out reviews mentioned Space Mountain
spaceMountain <- disneyland_ride2 %>%
filter(Space_Mountain == 1)
library(ggplot2)
plot1 <- ggplot(spaceMountain, aes(Rating)) +
geom_bar(stat="count", position = "dodge") +
ggtitle('Rating Distribution for Reviews Mentioned Ride Space Mountain')
plot1
#filter out negative reviews mentioned Space Mountain
spaceMountain_Neg <- spaceMountain %>%
filter(Rating_type == "negative")
plot2 <- ggplot(spaceMountain_Neg, aes(Rating)) +
geom_bar(stat="count", position = "dodge") +
ggtitle('Rating Distribution for Negtaive Reviews Mentioned Ride Space Mountain')
plot2
#for all Space Mountain(pos and neg) reviews, detect the specific sentence containing Space Mountain regardless of letter case
spaceMountain <- spaceMountain %>%
mutate(Ride_Sentence = str_extract(Review_Text, "(?i)\\b[^.]*Space Mountain[^.]*\\b"))
#spaceMountain
From the “Rating Distribution for Reviews Mentioned Ride Space
Mountain”, we can see that there are more 5 score ratings mentioning
“Space Mountain” than lower score ratings. Overall, visitors experience
with Space Mountain is positive.
We extracted the specific sentence in Review_Text
mentioning Space Mountain Ride, and store the sentence in the
Ride_Sentence column.
We hope to use a Binary Sentiment Lexicons called “bing” to
categorize words in Ride_Sentence as being positive or
negative.
library(tidytext)
as.data.frame(get_sentiments('bing'))[1:50,]
## word sentiment
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## 11 abound positive
## 12 abounds positive
## 13 abrade negative
## 14 abrasive negative
## 15 abrupt negative
## 16 abruptly negative
## 17 abscond negative
## 18 absence negative
## 19 absent-minded negative
## 20 absentee negative
## 21 absurd negative
## 22 absurdity negative
## 23 absurdly negative
## 24 absurdness negative
## 25 abundance positive
## 26 abundant positive
## 27 abuse negative
## 28 abused negative
## 29 abuses negative
## 30 abusive negative
## 31 abysmal negative
## 32 abysmally negative
## 33 abyss negative
## 34 accessable positive
## 35 accessible positive
## 36 accidental negative
## 37 acclaim positive
## 38 acclaimed positive
## 39 acclamation positive
## 40 accolade positive
## 41 accolades positive
## 42 accommodative positive
## 43 accomodative positive
## 44 accomplish positive
## 45 accomplished positive
## 46 accomplishment positive
## 47 accomplishments positive
## 48 accost negative
## 49 accurate positive
## 50 accurately positive
get_sentiments('bing')%>%
group_by(sentiment)%>%
count()
## # A tibble: 2 × 2
## # Groups: sentiment [2]
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
spaceMountain %>%
group_by(Review_ID) %>%
unnest_tokens(output = word, input = Ride_Sentence)%>%
inner_join(get_sentiments('bing'))%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=sentiment,y=n,fill=sentiment))+
geom_col()+
theme_economist()+
guides(fill=F)+
coord_flip() +
labs(title = "Sentiment Analysis of All Reviews for Space Mountain") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#observe more positive words than negative words overall
#see if this is true in negative reviews:
spaceMountain_Neg <- spaceMountain %>%
filter(Rating_type == "negative")
spaceMountain_Neg %>%
group_by(Review_ID) %>%
unnest_tokens(output = word, input = Ride_Sentence)%>%
inner_join(get_sentiments('bing'))%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=sentiment,y=n,fill=sentiment))+
geom_col()+
theme_economist()+
guides(fill=F)+
coord_flip() +
labs(title = "Sentiment Analysis of Negative Reviews for Space Mountain") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
#even if in overall negatively rated reviews, we still has more positive words in the sentence mentioned 'Space Mountain'.
#see comparison bewteen review rating and sentiment
library(ggthemes)
spaceMountain %>%
select(Review_ID, Ride_Sentence, Rating)%>%
group_by(Review_ID, Rating)%>%
unnest_tokens(output=word,input= Ride_Sentence)%>%
ungroup()%>%
inner_join(get_sentiments('bing'))%>%
group_by(Rating,sentiment)%>%
summarize(n = n())%>%
mutate(proportion = n/sum(n))%>%
ggplot(aes(x= Rating,y=proportion,fill=sentiment))+
geom_col()+
theme_economist()+
coord_flip() +
labs(title = "Sentiment Analysis of Reviews for Space Mountain in Different Rating Categories") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
From the “Sentiment Analysis of All Reviews for Space Mountain” plot, we see more positive sentiment than negative sentiment in all kinds of reviews. From the “Sentiment Analysis of Negative Reviews for Space Mountain”, we also observe more positive sentiment than negative sentiment when filter out only negative reviews (rated 1 or 2). But we can see the difference is smaller. And from the “Sentiment Analysis of Reviews for Space Mountain in Different Rating Categories” plot, we observe a similar pattern as before. There are both positive and negative sentiments in each rating categories. Positive reviews (4 and 5) has more positive sentiment proportion, and negative reviews (1 and 2) has more negative sentiment proportion.
Now, we want to observe the emotions in reviews mentioned ride Space Mountain.
nrc = read.table(file = 'https://raw.githubusercontent.com/pseudorational/data/master/nrc_lexicon.txt',
header = F,
col.names = c('word','sentiment','num'),
sep = '\t',
stringsAsFactors = F)
nrc = nrc[nrc$num!=0,]
nrc$num = NULL
spaceMountain %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Ride_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
arrange(desc(n))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 4 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
## # A tibble: 10 × 2
## # Groups: sentiment [10]
## sentiment n
## <chr> <int>
## 1 anticipation 7348
## 2 positive 4774
## 3 negative 2444
## 4 joy 2409
## 5 trust 2171
## 6 fear 1938
## 7 surprise 1417
## 8 sadness 1086
## 9 anger 593
## 10 disgust 364
spaceMountain %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Ride_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
geom_col()+
guides(fill=F)+
coord_flip()+
theme_wsj() +
labs(title = "NRC Analysis of ALL Reviews Mentioned Space Mountain") +
theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 4 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
spaceMountain_Neg %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Ride_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
geom_col()+
guides(fill=F)+
coord_flip()+
theme_wsj() +
labs(title = "NRC Analysis of Negative Reviews Mentioned Space Mountain") +
theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 18 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
From “NRC Analysis of ALL Reviews Mentioned Space Mountain” plot, “anticipation” and “positive” emotion appear more frequently than “negative” or “fear”. In the “NRC Analysis of Negative Reviews Mentioned Space Mountain”, we still observe many “anticipation” and “positive” feelings, but “fear” and “sadness” moved up the ranking. Meaning that, the overall emotions people towards the ride Space Mountain is still more positive even if they rated 1 or 2 for the general Disneyland experience. The use of negative words to express emotion and feeling is more frequent in negative rated reviews.
We want to analyze Pirates of the Caribbean Ride as well, since it is the ride with the second most number of occurrence in reviews. If we have more time in the future, we would analyze each ride one by one. But for now we only picked the first two because they have large sample sizes and the results is less biased.
rides_sum
## Space_Mountain Pirates Haunted_Mansion
## 3465 1293 1063
## Star_Tours Splash_Mountain Small_World
## 903 877 866
## Jungle_Cruise Peter_Pan Big_Thunder
## 337 126 118
## Dinosaur Dumbo Winnie_Pooh
## 29 28 24
## Toy_Story Mad_Tea Twilight_Zone
## 24 23 23
## Under_Sea Expedition_Everest Test_Track
## 18 14 10
## Astro_Orbiter Mission_Space Rock_Roller
## 9 5 5
## Soarin_Around Seven_Dwarfs Tomorrowland_Speedway
## 4 1 1
#filter out reviews mentioned Space Mountain
pirates <- disneyland_ride2 %>%
filter(Pirates == 1)
#head(pirates)
library(ggplot2)
plot3 <- ggplot(pirates, aes(Rating)) +
geom_bar(stat="count", position = "dodge") +
ggtitle('Rating Distribution for Reviews Mentioned Ride pirates')
plot3
#filter out negative reviews mentioned pirates
pirates_Neg <- pirates %>%
filter(Rating_type == "negative")
plot4 <- ggplot(pirates_Neg, aes(Rating)) +
geom_bar(stat="count", position = "dodge") +
ggtitle('Rating Distribution for Negtaive Reviews Mentioned Ride Pirates')
plot4
#for all Pirates (pos and neg) reviews, detect the specific sentence containing Pirates regardless of letter case
pirates <- pirates %>%
mutate(Pirates_Sentence = str_extract(Review_Text, "(?i)\\b[^.]*pirates[^.]*\\b"))
#pirates
From the “Rating Distribution for Reviews Mentioned Ride pirates” we again observe a similar rating pattern for all Disney reviews (regardless of mentioning ride or not) and the space mountain ride. The five-score rated review are the most, the one-score rated review are the least.
pirates %>%
group_by(Review_ID) %>%
unnest_tokens(output = word, input = Pirates_Sentence)%>%
inner_join(get_sentiments('bing'))%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=sentiment,y=n,fill=sentiment))+
geom_col()+
theme_economist()+
guides(fill=F)+
coord_flip() +
labs(title = "Sentiment Analysis of All Reviews for Pirates") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
#observe more positive words than negative words overall
#see if this is true in negative reviews:
pirates_Neg <- pirates %>%
filter(Rating_type == "negative")
pirates_Neg %>%
group_by(Review_ID) %>%
unnest_tokens(output = word, input = Pirates_Sentence)%>%
inner_join(get_sentiments('bing'))%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=sentiment,y=n,fill=sentiment))+
geom_col()+
theme_economist()+
guides(fill=F)+
coord_flip() +
labs(title = "Sentiment Analysis of Negative Reviews for Pirates") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
#even if in overall negatively rated reviews, we still has more positive words in the sentence mentioned 'Space Mountain'.
#see comparison bewteen review rating and sentiment
library(ggthemes)
pirates %>%
select(Review_ID, Pirates_Sentence, Rating)%>%
group_by(Review_ID, Rating)%>%
unnest_tokens(output=word,input= Pirates_Sentence)%>%
ungroup()%>%
inner_join(get_sentiments('bing'))%>%
group_by(Rating,sentiment)%>%
summarize(n = n())%>%
mutate(proportion = n/sum(n))%>%
ggplot(aes(x= Rating,y=proportion,fill=sentiment))+
geom_col()+
theme_economist()+
coord_flip() +
labs(title = "Sentiment Analysis of All Reviews for Pirates Across All Rating Categories") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
We observe more positive reviews for both “Sentiment Analysis of All
Reviews for Pirates” plot and “Sentiment Analysis of Negative Reviews
for Pirates”. Meaning that the general experience with the Ride Pirates
of the Caribbean is positive even if visitors gave a negative
rating.
From the “Sentiment Analysis of All Reviews for Pirates Across All
Rating Categories” plot, we observe a different proportion pattern. For
all review rating categories, there are more positive sentiment than
negative sentiment. Five-score rated reviews still has the most positive
sentiment proportion, but two-score rated reviews has more positive
sentiment proportion than 3-score rated reviews.
pirates %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Pirates_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
arrange(desc(n))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 12 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
## # A tibble: 10 × 2
## # Groups: sentiment [10]
## sentiment n
## <chr> <int>
## 1 positive 1672
## 2 anticipation 1557
## 3 negative 1009
## 4 joy 856
## 5 trust 749
## 6 fear 723
## 7 sadness 504
## 8 surprise 471
## 9 anger 194
## 10 disgust 116
pirates %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Pirates_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
geom_col()+
guides(fill=F)+
coord_flip()+
theme_wsj() +
labs(title = "NRC Analysis of ALL Reviews Mentioned Pirates") +
theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 12 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
pirates_Neg %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Pirates_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
geom_col()+
guides(fill=F)+
coord_flip()+
theme_wsj() +
labs(title = "NRC Analysis of Negative Reviews Mentioned Pirates") +
theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 7 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
From the “NRC Analysis of ALL Reviews Mentioned Pirates” plots, we see that “positive” and “anticipation” are at the top of the ranking, follow by “negative”. We observe a similar ranking at the “NRC Analysis of Negative Reviews Mentioned Pirates” plot.
We want to combine ride features in the “ride” dataset to our “disneyland_ride2” dataset so we can explore whether there is some relationship between ride features and rating. We filter out reviews mentioned one and only one ride because if there is more than one rides mentioned, we cannot match the corresponding ride features and cannot explore the link towards rating.
#head(disneyland_ride2,200)
#filter out reviews mentioned at least one ride
disneyland_ride3 <- disneyland_ride2 %>%
filter(rowSums(disneyland_ride2[, 11:34]) != 0)
#filter out reviews mentioned exactly one ride
disneyland_ride4 <- disneyland_ride3[rowSums(disneyland_ride3[, 11:34] == 1) == 1, ]
#head(disneyland_ride4)
#create new column get the name of the ride mentioned in review
disneyland_ride4 <- disneyland_ride4 %>%
mutate(Ride_Mentioned = case_when(
`Astro_Orbiter` == 1 ~ "Astro Orbiter",
`Big_Thunder` == 1 ~ "Big Thunder Mountain Railroad",
`Dinosaur` == 1 ~ "Dinosaur",
`Dumbo` == 1 ~ "Dumbo the Flying Elephant",
`Expedition_Everest` == 1 ~ "Expedition Everest",
`Haunted_Mansion` == 1 ~ "Haunted Mansion",
`Small_World` == 1 ~ "It's a Small World",
`Jungle_Cruise` == 1 ~ "Jungle Cruise",
`Mad_Tea` == 1 ~ "Mad Tea Party",
`Mission_Space` == 1 ~ "Mission Space",
`Peter_Pan` == 1 ~ "Peter Pan's Flight",
`Pirates` == 1 ~ "Pirates of the Caribbean",
`Rock_Roller` == 1 ~ "Rock 'n' Roller Coaster",
`Seven_Dwarfs` == 1 ~ "Seven Dwarfs Mine Train",
`Soarin_Around` == 1 ~ "Soarin' Around the World",
`Space_Mountain` == 1 ~ "Space Mountain",
`Splash_Mountain` == 1 ~ "Splash Mountain",
`Star_Tours` == 1 ~ "Star Tours",
`Test_Track` == 1 ~ "Test Track",
`Winnie_Pooh` == 1 ~ "The Many Adventures of Winnie the Pooh",
`Twilight_Zone` == 1 ~ "The Twilight Zone Tower of Terror",
`Tomorrowland_Speedway` == 1 ~ "Tomorrowland Speedway",
`Toy_Story` == 1 ~ "Toy Story Mania",
`Under_Sea` == 1 ~ "Under the Sea",
TRUE ~ NA_character_
))
#drop irrelevant columns:
disneyland_ride4 <- disneyland_ride4 %>%
select(-c(11:34))
#head(disneyland_ride4,100)
#combine two dataset
#head(rides)
rides_new <- rides[, -c(2, 3)]
disneyland_ridefull <- disneyland_ride4 %>%
left_join(rides_new, by = c("Ride_Mentioned" = "Ride_name"))
#disneyland_ridefull
R_thrill <- disneyland_ridefull %>%
filter(Ride_type_thrill == 1)
plot_thrill <- ggplot(R_thrill, aes(Rating)) +
geom_bar(stat="count", position = "dodge")
plot_thrill
ggplot(data = disneyland_ridefull, aes(Rating, fill = Ride_type_all)) +
geom_bar(stat = "count", position = "stack") +
labs(title = "Count of Ride Ratings by Ride Type Across All Rating Categories") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
From the above plot, we observe that for each rating categories, we
have similar distribution for ride_type_all. Specifically,
rides that are “thrill, big drops, dark” occupies the most in all rating
types.
We want to see ride features individually, so we would separate the
features and see if there is some effect towards rating.
#Thrill
thrill_prop <- disneyland_ridefull %>%
group_by(Rating, Ride_type_thrill) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
ggplot(thrill_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_thrill))) +
geom_col(position = "stack") +
scale_fill_manual(values = c("lightblue", "pink"),
name = "If Ride is Thrill",
labels = c("Not Thrilling", "Thrilling")) +
labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Thrillness") +
theme_bw()
#head(disneyland_ridefull)
#Spinning
spin_prop <- disneyland_ridefull %>%
group_by(Rating, Ride_type_spinning) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
spin_prop
## # A tibble: 10 × 4
## # Groups: Rating [5]
## Rating Ride_type_spinning n prop
## <int> <dbl> <int> <dbl>
## 1 1 0 111 0.991
## 2 1 1 1 0.00893
## 3 2 0 203 0.995
## 4 2 1 1 0.00490
## 5 3 0 547 0.995
## 6 3 1 3 0.00545
## 7 4 0 1142 0.996
## 8 4 1 5 0.00436
## 9 5 0 1956 0.991
## 10 5 1 17 0.00862
ggplot(spin_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_spinning))) +
geom_col(position = "stack") +
scale_fill_manual(values = c("lightblue", "pink"),
name = "If Ride is Spinning",
labels = c("Not Spinning", "Spinning")) +
labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Spinningness") +
theme_bw()
#head(disneyland_ridefull)
#Slow
slow_prop <- disneyland_ridefull %>%
group_by(Rating, Ride_type_slow) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
slow_prop
## # A tibble: 10 × 4
## # Groups: Rating [5]
## Rating Ride_type_slow n prop
## <int> <dbl> <int> <dbl>
## 1 1 0 73 0.652
## 2 1 1 39 0.348
## 3 2 0 136 0.667
## 4 2 1 68 0.333
## 5 3 0 366 0.665
## 6 3 1 184 0.335
## 7 4 0 773 0.674
## 8 4 1 374 0.326
## 9 5 0 1292 0.655
## 10 5 1 681 0.345
ggplot(slow_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_slow))) +
geom_col(position = "stack") +
scale_fill_manual(values = c("lightblue", "pink"),
name = "Speed of Ride",
labels = c("Quick", "Slow")) +
labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Speed") +
theme_bw()
#head(disneyland_ridefull)
#check if Small Drops and Big Drops complement each other
all(disneyland_ridefull$Ride_type_small_drops == !disneyland_ridefull$Ride_type_big_drops)
## [1] FALSE
#They do not complement each other -> might be rides with no drops at all
disneyland_ridefull <- disneyland_ridefull %>%
mutate(Ride_type_drop = if_else(Ride_type_small_drops == 1, 1,
if_else(Ride_type_big_drops == 1, 2, 0)))
drop_prop <- disneyland_ridefull %>%
group_by(Rating, Ride_type_drop) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
drop_prop
## # A tibble: 15 × 4
## # Groups: Rating [5]
## Rating Ride_type_drop n prop
## <int> <dbl> <int> <dbl>
## 1 1 0 20 0.179
## 2 1 1 33 0.295
## 3 1 2 59 0.527
## 4 2 0 40 0.196
## 5 2 1 45 0.221
## 6 2 2 119 0.583
## 7 3 0 108 0.196
## 8 3 1 119 0.216
## 9 3 2 323 0.587
## 10 4 0 238 0.207
## 11 4 1 220 0.192
## 12 4 2 689 0.601
## 13 5 0 459 0.233
## 14 5 1 388 0.197
## 15 5 2 1126 0.571
ggplot(drop_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_drop))) +
geom_col(position = "stack") +
scale_fill_manual(values = c("lightblue", "pink", "mediumpurple"),
name = "If Ride has Drops",
labels = c("No Drops", "Small Drops", "Big Drops")) +
labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Drop Type") +
theme_bw()
#head(disneyland_ridefull)
#Dark
dark_prop <- disneyland_ridefull %>%
group_by(Rating, Ride_type_dark) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
dark_prop
## # A tibble: 10 × 4
## # Groups: Rating [5]
## Rating Ride_type_dark n prop
## <int> <dbl> <int> <dbl>
## 1 1 0 34 0.304
## 2 1 1 78 0.696
## 3 2 0 53 0.260
## 4 2 1 151 0.740
## 5 3 0 148 0.269
## 6 3 1 402 0.731
## 7 4 0 325 0.283
## 8 4 1 822 0.717
## 9 5 0 591 0.300
## 10 5 1 1382 0.700
ggplot(dark_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_dark))) +
geom_col(position = "stack") +
scale_fill_manual(values = c("lightblue", "pink"),
name = "If Ride is Dark",
labels = c("Not Dark", "Dark")) +
labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Darkness") +
theme_bw()
#head(disneyland_ridefull)
#scary
scary_prop <- disneyland_ridefull %>%
group_by(Rating, Ride_type_scary) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
scary_prop
## # A tibble: 10 × 4
## # Groups: Rating [5]
## Rating Ride_type_scary n prop
## <int> <dbl> <int> <dbl>
## 1 1 0 111 0.991
## 2 1 1 1 0.00893
## 3 2 0 200 0.980
## 4 2 1 4 0.0196
## 5 3 0 548 0.996
## 6 3 1 2 0.00364
## 7 4 0 1142 0.996
## 8 4 1 5 0.00436
## 9 5 0 1964 0.995
## 10 5 1 9 0.00456
ggplot(scary_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_scary))) +
geom_col(position = "stack") +
scale_fill_manual(values = c("lightblue", "pink"),
name = "If Ride is Scary",
labels = c("Not Scary", "Scary")) +
labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Scariness") +
theme_bw()
#head(disneyland_ridefull)
#water
water_prop <- disneyland_ridefull %>%
group_by(Rating, Ride_type_water) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
water_prop
## # A tibble: 10 × 4
## # Groups: Rating [5]
## Rating Ride_type_water n prop
## <int> <dbl> <int> <dbl>
## 1 1 0 103 0.920
## 2 1 1 9 0.0804
## 3 2 0 190 0.931
## 4 2 1 14 0.0686
## 5 3 0 509 0.925
## 6 3 1 41 0.0745
## 7 4 0 1071 0.934
## 8 4 1 76 0.0663
## 9 5 0 1811 0.918
## 10 5 1 162 0.0821
ggplot(water_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_water))) +
geom_col(position = "stack") +
scale_fill_manual(values = c("lightblue", "pink"),
name = "If Ride has Water",
labels = c("No Water", "Water")) +
labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on if Rides has Water") +
theme_bw()
#head(disneyland_ridefull)
#if can use fast pass
table(disneyland_ridefull$Fast_pass)
##
## 0 1
## 2 3984
# Almost all ride can use fast pass, skip this feature
table(disneyland_ridefull$Classic)
##
## 0 1
## 633 3353
#classic
classic_prop <- disneyland_ridefull %>%
group_by(Rating, Classic) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
classic_prop
## # A tibble: 10 × 4
## # Groups: Rating [5]
## Rating Classic n prop
## <int> <dbl> <int> <dbl>
## 1 1 0 25 0.223
## 2 1 1 87 0.777
## 3 2 0 35 0.172
## 4 2 1 169 0.828
## 5 3 0 81 0.147
## 6 3 1 469 0.853
## 7 4 0 165 0.144
## 8 4 1 982 0.856
## 9 5 0 327 0.166
## 10 5 1 1646 0.834
ggplot(classic_prop, aes(x = Rating, y = prop, fill = factor(Classic))) +
geom_col(position = "stack") +
scale_fill_manual(values = c("lightblue", "pink"),
name = "If Ride is Classic",
labels = c("No", "Yes")) +
labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on if Rides is Classic") +
theme_bw()
From all of the above plots, we can see that no matter what ride feature it is, the distribution in each rating category’s bar is the same. Meaning that the ride feature is not associate with ratings. Using the “Proportion of Ratings based on Rides’ Speed” as an example, if see the distribution of Quick rides increases when rating increases, that could symbolize a quick ride would have higher rating than a slow ride. But what we got is the same distribution across all ratings, and this only symbolizes the proportion of quick and slow rides in Disneyland. The spinning graph tells that there are more non-spinning rides in the park so neutrally there will be more reviews mentioning rides with a non-spinning features across all rating categories.
ggplot(data = disneyland_ridefull, aes(Rating, fill = Age_interest_all)) +
geom_bar(stat = "count", position = "stack") +
labs(title = "Count of Ride Ratings by Ride Age Interest Group Across All Rating Categories") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
The “kids, teens, teens, adults” group only exclude the “preschoolers” from “all ages” group. All rating categories have similar age group distribution.
linear_model <- lm(Rating ~ Ride_type_thrill + Ride_type_slow + Ride_type_spinning + Ride_type_drop + Ride_type_dark + Ride_type_scary + Ride_type_water + Classic + Height_req_inches + Ride_duration_min, data = disneyland_ridefull)
summary(linear_model)
##
## Call:
## lm(formula = Rating ~ Ride_type_thrill + Ride_type_slow + Ride_type_spinning +
## Ride_type_drop + Ride_type_dark + Ride_type_scary + Ride_type_water +
## Classic + Height_req_inches + Ride_duration_min, data = disneyland_ridefull)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2531 -0.2460 -0.0635 0.8347 1.2708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88812 0.47108 8.254 < 2e-16 ***
## Ride_type_thrill 0.11500 0.24555 0.468 0.63957
## Ride_type_slow 0.25022 0.40990 0.610 0.54159
## Ride_type_spinning 0.23000 0.29827 0.771 0.44068
## Ride_type_drop -0.19279 0.06729 -2.865 0.00419 **
## Ride_type_dark 0.01647 0.07547 0.218 0.82728
## Ride_type_scary -0.32794 0.32678 -1.004 0.31566
## Ride_type_water 0.15925 0.28939 0.550 0.58216
## Classic 0.07509 0.22615 0.332 0.73988
## Height_req_inches 0.01019 0.01059 0.962 0.33589
## Ride_duration_min 0.00310 0.02182 0.142 0.88701
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 3975 degrees of freedom
## Multiple R-squared: 0.004076, Adjusted R-squared: 0.001571
## F-statistic: 1.627 on 10 and 3975 DF, p-value: 0.09263
plot(linear_model)
We built a multi-variable linear regression models to see if ride
features has effect on overall disneyland rating. We only observe a
small p-value for Ride_type_drop. We reject the null
hypothesis and conclude that number of drops a ride has will have effect
on Disneyland’s overall rating. From the graph, we conclude that visitor
experiencing rides with no drop will give a higher rating. For other
ride features, we have large p-values so we fail to reject null
hypothesis. Meaning that reviews mentioning ride features other than
drops would not affect overall rating.
From the tf/tfidf graph, we see many reviews mentioning “staff”. We would like to explore reviews and the specific sentence in the review mentioned about staff (as part of customer experience description). We hope to know the relationship between review mentioning staff and the overall disneyland experience rating.
library(dplyr)
library(stringr)
#creating new column called staff, if review_text include "staff", store 1, otherwise 0
disneyland_staff <- disneyland %>%
mutate(Staff = case_when(grepl(c("staff"), Review_Text, ignore.case = TRUE) ~ 1,
TRUE ~ 0))
#filter out rows that mentioned staff in review_text
staff <- disneyland_staff %>%
filter(Staff == 1)
#head(staff)
#find the exact sentence in Review_Text column that mentioned about staff
staff <- staff %>%
mutate(Staff_Sentence = str_extract(Review_Text, "(?i)\\b[^.]*Staff[^.]*\\b"))
plot_staff <- ggplot(staff, aes(Rating)) +
geom_bar(stat="count", position = "dodge") +
ggtitle('Rating Distribution for Reviews Mentioned Staff')
plot_ratingall <- ggplot(disneyland, aes(Rating)) +
geom_bar(stat="count", position = "dodge") +
ggtitle('Rating Distribution for All Reviews')
plot_staff
plot_ratingall
table(staff$Rating)
##
## 1 2 3 4 5
## 397 514 903 1313 2520
table(disneyland$Rating)
##
## 1 2 3 4 5
## 1338 1929 4782 10086 21908
staff_prop <- table(staff$Rating)/table(disneyland$Rating)
colors <- c("lightcoral", "tan1", "wheat", "darkseagreen2", "steelblue")
barplot(staff_prop, main = "The Proportion of Reviews Mentioned Staff", xlab = "Rating in All Reviews", ylab = "Proportion", col = colors)
The plot “Rating Distribution for Reviews Mentioned Staff” shared the same rating distribution pattern as “Rating Distribution for All Reviews”. We have the most reviews in rating 5 category, and as the rating decrease the count also decreases.
staff %>%
group_by(Review_ID) %>%
unnest_tokens(output = word, input = Staff_Sentence)%>%
inner_join(get_sentiments('bing'))%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=sentiment,y=n,fill=sentiment))+
geom_col()+
theme_economist()+
guides(fill=F)+
coord_flip() +
labs(title = "Sentiment Analysis of All Reviews for Staff") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
#observe more positive words than negative words overall
#see if this is true in negative reviews:
staff_Neg <- staff %>%
filter(Rating_type == "negative")
staff_Neg %>%
group_by(Review_ID) %>%
unnest_tokens(output = word, input = Staff_Sentence)%>%
inner_join(get_sentiments('bing'))%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=sentiment,y=n,fill=sentiment))+
geom_col()+
theme_economist()+
guides(fill=F)+
coord_flip() +
labs(title = "Sentiment Analysis of Negative Reviews for Staff") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
#in overall negatively rated reviews, there are more negative review than positive review
#see comparison bewteen review rating and sentiment
library(ggthemes)
staff %>%
select(Review_ID, Staff_Sentence, Rating)%>%
group_by(Review_ID, Rating)%>%
unnest_tokens(output=word,input= Staff_Sentence)%>%
ungroup()%>%
inner_join(get_sentiments('bing'))%>%
group_by(Rating,sentiment)%>%
summarize(n = n())%>%
mutate(proportion = n/sum(n))%>%
ggplot(aes(x= Rating,y=proportion,fill=sentiment))+
geom_col()+
theme_economist()+
coord_flip() +
labs(title = "Sentiment Analysis of All Reviews for Staff Across All Rating Categories") +
theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
From the above graphs, we again observe more positive sentiment
reviews than negative sentiment reviews in the reviews mentioned about
staff. However, in the negatively rated review (with score 1 or 2), the
number of negative sentiment exceeds positive sentiments. This could
explain that negative description about staff might associate with lower
rating.
For all rating categories, there are both positive and negative
sentiment used. Higher rating has higher proportion of positive
sentiments.
staff %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Staff_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
arrange(desc(n))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 3 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
## # A tibble: 10 × 2
## # Groups: sentiment [10]
## sentiment n
## <chr> <int>
## 1 positive 10505
## 2 joy 7263
## 3 trust 7209
## 4 anticipation 5508
## 5 negative 3074
## 6 surprise 2221
## 7 sadness 1635
## 8 fear 1353
## 9 anger 1164
## 10 disgust 965
staff %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Staff_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
geom_col()+
guides(fill=F)+
coord_flip()+
theme_wsj() +
labs(title = "NRC Analysis of ALL Reviews Mentioned Staff") +
theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 3 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
staff_Neg %>%
group_by(Review_ID)%>%
unnest_tokens(output = word, input = Staff_Sentence)%>%
inner_join(nrc)%>%
group_by(sentiment)%>%
count()%>%
ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
geom_col()+
guides(fill=F)+
coord_flip()+
theme_wsj() +
labs(title = "NRC Analysis of Negtive Reviews Reviews Mentioned Staff") +
theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 19 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
From “NRC Analysis of ALL Reviews Mentioned Staff” graph, the top four emotions are all positive. But in the “NRC Analysis of ALL Reviews Mentioned Staff”, the “negative” moved to rank 2. This again proves that in lower rating reviews, there are more negative reviews mentioned about staff.
# Add review_rating back to dataframe of features
disneyland_data = cbind(rating = disneyland$Rating,xdtm1)
disneyland_data_tfidf = cbind(rating = disneyland$Rating,xdtm_tfidf1)
head(disneyland_data)
## rating busier day disneyland ever feel find hong kong main one queue ride
## 1 4 1 1 2 1 1 1 1 1 1 1 1 1
## 2 4 0 0 1 0 2 0 0 0 1 0 0 0
## 3 4 1 1 0 0 0 1 0 0 1 0 1 1
## 4 4 0 0 2 0 0 0 0 0 0 0 0 0
## 5 4 0 0 1 0 0 0 1 1 0 0 0 0
## 6 3 0 0 5 0 1 0 1 1 0 1 0 2
## small street visit walk well world worth also area attract bit disney dont
## 1 1 1 1 1 1 1 1 0 0 0 0 0 0
## 2 0 1 1 0 0 0 0 1 1 2 1 2 1
## 3 1 0 1 0 0 0 0 0 0 2 1 0 0
## 4 0 0 1 0 0 0 0 0 1 0 1 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 2 0 0 0 1 1 0 0 0 1 0 4 1
## especial even expect experiance good got great just last less like member
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 3 1 1 3 1
## 3 0 2 1 0 1 0 1 0 1 0 0 0
## 4 0 0 0 0 0 0 1 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 1 0
## 6 0 1 0 0 2 0 0 2 0 0 0 0
## mountain now open park place realli seem since somethin staff star stay theme
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 1 2 1 2 1 1 1 1 1 1 1 1 2
## 3 0 0 0 2 0 3 0 0 2 0 0 0 0
## 4 0 0 0 1 0 1 0 0 0 0 0 0 0
## 5 0 0 0 0 0 1 0 0 0 0 0 0 0
## 6 0 0 0 1 0 1 0 0 0 0 0 0 0
## time whole amazaaaaah.. around arrival big castle close enjoy everyone food
## 1 0 0 0 0 0 0 0 0 0 0 0
## 2 2 1 0 0 0 0 0 0 0 0 0
## 3 2 0 1 1 1 1 1 1 1 1 1
## 4 0 0 0 0 0 0 1 1 0 0 1
## 5 0 0 0 1 0 0 0 0 0 0 0
## 6 0 0 0 1 0 0 0 0 0 0 4
## hour lot minut. much parad quit shop way will can crowd drink kid love pay
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 1 1 1 1 1 2 2 1 1 0 0 0 0 0 0
## 4 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1
## 5 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
## 6 0 0 0 1 1 0 1 1 0 1 1 0 0 0 0
## price work everythig took children expense fast however line managable never
## 1 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
## 4 1 1 0 0 0 0 0 0 0 0 0
## 5 0 0 1 1 0 0 0 0 0 0 0
## 6 0 0 0 0 1 3 1 1 1 1 1
## peopl see show take ticket tri water bad daughter know though went best
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 2 2 1 1 1 1 3 0 0 0 0 0 0
## disappoint little magic plan restaurant servicable. think week charactars.
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## eat enough fantastci fun get money photo train want better come holiday miss
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## must save say start two florida still mania space spend spent back earlier
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## firework night young familiar made age help need look min recommend wait pass
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## half smaller definitaley book mickey ablaze california comparable didnt first
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## make next. nice sure year adult beauties buy new old wonder land differ clean
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## high trip end found hotel light bring everithing although pirate thing use
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0
## full right happier without alway long friend pariah. part meet give watch
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0
## return least anothe adventure cant
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
head(disneyland_data_tfidf)
## rating busier day disneyland ever feel find hong
## 1 4 3.506264 1.139515 2.261463 4.270367 3.566882 3.720356 4.112466
## 2 4 0.000000 0.000000 1.130732 0.000000 7.133764 0.000000 0.000000
## 3 4 3.506264 1.139515 0.000000 0.000000 0.000000 3.720356 0.000000
## 4 4 0.000000 0.000000 2.261463 0.000000 0.000000 0.000000 0.000000
## 5 4 0.000000 0.000000 1.130732 0.000000 0.000000 0.000000 4.112466
## 6 3 0.000000 0.000000 5.653658 0.000000 3.566882 0.000000 4.112466
## kong main one queue ride small street visit
## 1 4.124356 3.504219 1.705005 2.55117 0.8469021 2.944967 4.118086 1.685868
## 2 0.000000 3.504219 0.000000 0.00000 0.0000000 0.000000 4.118086 1.685868
## 3 0.000000 3.504219 0.000000 2.55117 0.8469021 2.944967 0.000000 1.685868
## 4 0.000000 0.000000 0.000000 0.00000 0.0000000 0.000000 0.000000 1.685868
## 5 4.124356 0.000000 0.000000 0.00000 0.0000000 0.000000 0.000000 0.000000
## 6 4.124356 0.000000 1.705005 0.00000 1.6938041 5.889933 0.000000 0.000000
## walk well world worth also area attract bit
## 1 2.931985 2.639333 2.782955 2.875106 0.000000 0.000000 0.000000 0.000000
## 2 0.000000 0.000000 0.000000 0.000000 2.584494 3.758368 5.400159 3.593599
## 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.400159 3.593599
## 4 0.000000 0.000000 0.000000 0.000000 0.000000 3.758368 0.000000 3.593599
## 5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 6 0.000000 2.639333 2.782955 0.000000 0.000000 0.000000 2.700079 0.000000
## disney dont especial even expect experiance good got
## 1 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000
## 2 2.367199 2.828039 3.909884 2.289439 3.09882 2.556671 2.187123 3.05265
## 3 0.000000 0.000000 0.000000 4.578877 3.09882 0.000000 2.187123 0.00000
## 4 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000
## 5 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000
## 6 4.734398 2.828039 0.000000 2.289439 0.00000 0.000000 4.374245 0.00000
## great just last less like member mountain now
## 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 1.849821 5.965014 3.597083 4.008492 6.601919 4.190572 3.040742 7.866722
## 3 1.849821 0.000000 3.597083 0.000000 0.000000 0.000000 0.000000 0.000000
## 4 1.849821 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 5 0.000000 0.000000 0.000000 0.000000 2.200640 0.000000 0.000000 0.000000
## 6 0.000000 3.976676 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## open park place realli seem since somethin staff
## 1 0.000000 0.0000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
## 2 3.228567 1.7334754 1.875502 2.336521 3.713251 4.02482 4.021891 2.874313
## 3 0.000000 1.7334754 0.000000 7.009564 0.000000 0.00000 8.043781 0.000000
## 4 0.000000 0.8667377 0.000000 2.336521 0.000000 0.00000 0.000000 0.000000
## 5 0.000000 0.0000000 0.000000 2.336521 0.000000 0.00000 0.000000 0.000000
## 6 0.000000 0.8667377 0.000000 2.336521 0.000000 0.00000 0.000000 0.000000
## star stay theme time whole amazaaaaah.. around arrival
## 1 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 4.031875 2.91912 6.960597 2.137617 3.799916 0.000000 0.000000 0.000000
## 3 0.000000 0.00000 0.000000 2.137617 0.000000 2.995216 2.714669 4.151631
## 4 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 5 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 2.714669 0.000000
## 6 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 2.714669 0.000000
## big castle close enjoy everyone food hour lot
## 1 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 3 3.224856 3.88785 2.807211 2.237374 3.753987 2.064658 2.621042 2.444361
## 4 0.000000 3.88785 2.807211 0.000000 0.000000 2.064658 0.000000 0.000000
## 5 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 2.621042 0.000000
## 6 0.000000 0.00000 0.000000 0.000000 0.000000 8.258631 0.000000 0.000000
## minut. much parad quit shop way will can
## 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 3 2.966208 2.328175 2.246748 7.538270 7.451179 3.175195 2.271758 0.000000
## 4 0.000000 0.000000 0.000000 3.769135 3.725589 0.000000 2.271758 1.946493
## 5 0.000000 2.328175 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 6 0.000000 2.328175 2.246748 0.000000 3.725589 3.175195 0.000000 1.946493
## crowd drink kid love pay price work everythig
## 1 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000
## 2 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000
## 3 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000
## 4 2.635074 4.016633 2.046386 2.01212 4.303421 3.196505 3.761297 0.000000
## 5 2.635074 0.000000 2.046386 0.00000 0.000000 0.000000 0.000000 3.109975
## 6 2.635074 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000
## took children expense fast however line managable never
## 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 5 3.637314 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 6 0.000000 2.936391 8.878398 2.634403 3.772577 2.259975 4.142058 3.361114
## peopl see show take ticket tri water bad daughter know
## 1 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000 0.00000 0 0 0
## 2 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000 0.00000 0 0 0
## 3 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000 0.00000 0 0 0
## 4 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000 0.00000 0 0 0
## 5 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000 0.00000 0 0 0
## 6 5.13035 4.4356 2.284691 2.534584 2.943857 3.447698 12.50124 0 0 0
## though went best disappoint little magic plan restaurant servicable. think
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## week charactars. eat enough fantastci fun get money photo train want better
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0
## come holiday miss must save say start two florida still mania space spend
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## spent back earlier firework night young familiar made age help need look min
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## recommend wait pass half smaller definitaley book mickey ablaze california
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## comparable didnt first make next. nice sure year adult beauties buy new old
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## wonder land differ clean high trip end found hotel light bring everithing
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0
## although pirate thing use full right happier without alway long friend
## 1 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0
## pariah. part meet give watch return least anothe adventure cant
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
We split the new document-term-matrix dataframe using the Term Frequency weighting after adding the rating back into training and testing datasets, where training dataset contains 70% of the dataframe and testing has the rest.
set.seed(617)
split = sample(1:nrow(disneyland_data),size = 0.7*nrow(disneyland_data))
train = disneyland_data[split,]
test = disneyland_data[-split,]
Firstly, we used a regression tree to predict rating using all other variables, term frequencies.
library(rpart)
#install.packages('rpart.plot')
library(rpart.plot)
tree = rpart(rating~.,train)
rpart.plot(tree)
We applied the predictions of the tree to the test sample to compute root mean square error (RMSE). The RMSE is 0.9900591.
pred_tree = predict(tree,newdata=test)
rmse_tree = sqrt(mean((pred_tree - test$rating)^2)); rmse_tree
## [1] 0.9900591
Next, we used a regression to predict rating using all other variables, term frequencies.
reg = lm(rating~.,train)
summary(reg)
##
## Call:
## lm(formula = rating ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2217 -0.4488 0.2174 0.6066 4.8907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2467335 0.0089051 476.886 < 2e-16 ***
## busier 0.0347614 0.0150880 2.304 0.021235 *
## day 0.0333278 0.0052558 6.341 2.32e-10 ***
## disneyland 0.0105037 0.0054022 1.944 0.051867 .
## ever -0.1002808 0.0213443 -4.698 2.64e-06 ***
## feel -0.0340780 0.0156235 -2.181 0.029177 *
## find -0.0320150 0.0161639 -1.981 0.047641 *
## hong 0.0433834 0.0858173 0.506 0.613189
## kong -0.0819234 0.0858294 -0.954 0.339843
## main -0.0207194 0.0174152 -1.190 0.234163
## one -0.0209127 0.0068667 -3.046 0.002325 **
## queue -0.0689125 0.0076874 -8.964 < 2e-16 ***
## ride -0.0349307 0.0042200 -8.277 < 2e-16 ***
## small -0.0895776 0.0136658 -6.555 5.67e-11 ***
## street 0.0442219 0.0215938 2.048 0.040579 *
## visit 0.0172332 0.0067959 2.536 0.011224 *
## walk 0.0033093 0.0120249 0.275 0.783163
## well 0.0489403 0.0112812 4.338 1.44e-05 ***
## world 0.0447760 0.0115730 3.869 0.000110 ***
## worth 0.1062335 0.0128101 8.293 < 2e-16 ***
## also 0.0154785 0.0098140 1.577 0.114766
## area -0.0019845 0.0145306 -0.137 0.891366
## attract -0.0419941 0.0093605 -4.486 7.28e-06 ***
## bit 0.0714168 0.0157229 4.542 5.59e-06 ***
## disney -0.0440046 0.0050310 -8.747 < 2e-16 ***
## dont -0.0429540 0.0111326 -3.858 0.000114 ***
## especial 0.0201458 0.0189079 1.065 0.286672
## even -0.0390860 0.0098418 -3.971 7.16e-05 ***
## expect -0.0307576 0.0132084 -2.329 0.019885 *
## experiance -0.0390827 0.0102264 -3.822 0.000133 ***
## good -0.0083701 0.0086481 -0.968 0.333127
## got -0.0206250 0.0110294 -1.870 0.061493 .
## great 0.1237688 0.0078111 15.845 < 2e-16 ***
## just -0.0214620 0.0077731 -2.761 0.005765 **
## last -0.0977375 0.0163408 -5.981 2.24e-09 ***
## less -0.0198917 0.0186200 -1.068 0.285396
## like -0.0154008 0.0088788 -1.735 0.082832 .
## member -0.0085260 0.0152672 -0.558 0.576538
## mountain 0.0352843 0.0148378 2.378 0.017413 *
## now -0.0775180 0.0168848 -4.591 4.43e-06 ***
## open 0.0132875 0.0121574 1.093 0.274424
## park -0.0087809 0.0040322 -2.178 0.029435 *
## place 0.0113032 0.0080716 1.400 0.161418
## realli -0.0023677 0.0084598 -0.280 0.779572
## seem -0.0298034 0.0147240 -2.024 0.042966 *
## since 0.0146792 0.0184325 0.796 0.425821
## somethin -0.0032142 0.0196439 -0.164 0.870028
## staff -0.1624885 0.0110015 -14.770 < 2e-16 ***
## star 0.0273293 0.0177327 1.541 0.123285
## stay 0.0502498 0.0129720 3.874 0.000107 ***
## theme -0.0085686 0.0138255 -0.620 0.535413
## time 0.0345154 0.0053564 6.444 1.18e-10 ***
## whole -0.0313849 0.0175505 -1.788 0.073745 .
## amazaaaaah.. 0.2149915 0.0122545 17.544 < 2e-16 ***
## around 0.0130169 0.0108747 1.197 0.231321
## arrival -0.0106907 0.0188520 -0.567 0.570660
## big 0.0444973 0.0143201 3.107 0.001890 **
## castle -0.0311564 0.0178376 -1.747 0.080706 .
## close -0.1549234 0.0100884 -15.357 < 2e-16 ***
## enjoy 0.0671976 0.0095628 7.027 2.16e-12 ***
## everyone 0.1042187 0.0174031 5.989 2.14e-09 ***
## food -0.0231880 0.0089052 -2.604 0.009222 **
## hour -0.1605690 0.0102145 -15.720 < 2e-16 ***
## lot 0.0332516 0.0095216 3.492 0.000480 ***
## minut. -0.0457196 0.0108264 -4.223 2.42e-05 ***
## much 0.0044375 0.0098809 0.449 0.653365
## parad 0.0317380 0.0099334 3.195 0.001399 **
## quit -0.0147260 0.0164663 -0.894 0.371165
## shop -0.0102745 0.0154741 -0.664 0.506710
## way -0.0673021 0.0131706 -5.110 3.24e-07 ***
## will 0.0091372 0.0077702 1.176 0.239631
## can 0.0822794 0.0076619 10.739 < 2e-16 ***
## crowd -0.0982473 0.0098666 -9.958 < 2e-16 ***
## drink -0.0095182 0.0180349 -0.528 0.597666
## kid -0.0568329 0.0072239 -7.867 3.75e-15 ***
## love 0.1390220 0.0081052 17.152 < 2e-16 ***
## pay -0.1766237 0.0199262 -8.864 < 2e-16 ***
## price -0.1002167 0.0135440 -7.399 1.41e-13 ***
## work -0.0584159 0.0167784 -3.482 0.000499 ***
## everythig 0.0836214 0.0137833 6.067 1.32e-09 ***
## took 0.0008739 0.0164736 0.053 0.957692
## children -0.0684988 0.0105164 -6.514 7.47e-11 ***
## expense -0.0959482 0.0137374 -6.984 2.92e-12 ***
## fast -0.0056744 0.0126493 -0.449 0.653728
## however -0.0384122 0.0152817 -2.514 0.011956 *
## line -0.0373275 0.0074568 -5.006 5.60e-07 ***
## managable -0.0639526 0.0184304 -3.470 0.000521 ***
## never -0.0587260 0.0150032 -3.914 9.09e-05 ***
## peopl -0.1060211 0.0089751 -11.813 < 2e-16 ***
## see 0.0199983 0.0089320 2.239 0.025167 *
## show 0.0247205 0.0081679 3.027 0.002476 **
## take 0.0310479 0.0101796 3.050 0.002291 **
## ticket -0.0368143 0.0087390 -4.213 2.53e-05 ***
## tri -0.0575286 0.0142676 -4.032 5.54e-05 ***
## water 0.0112922 0.0167072 0.676 0.499116
## bad -0.0870567 0.0200245 -4.348 1.38e-05 ***
## daughter -0.0113881 0.0136518 -0.834 0.404185
## know 0.0093184 0.0165136 0.564 0.572563
## though 0.0868489 0.0156160 5.562 2.70e-08 ***
## went -0.0127040 0.0094198 -1.349 0.177459
## best 0.1357465 0.0130657 10.390 < 2e-16 ***
## disappoint -0.3420133 0.0147622 -23.168 < 2e-16 ***
## little 0.0145083 0.0117925 1.230 0.218595
## magic 0.0825491 0.0099256 8.317 < 2e-16 ***
## plan 0.0627160 0.0141147 4.443 8.89e-06 ***
## restaurant -0.0075581 0.0134842 -0.561 0.575132
## servicable. -0.1359655 0.0172513 -7.881 3.35e-15 ***
## think -0.0608032 0.0145617 -4.176 2.98e-05 ***
## week 0.0355589 0.0188379 1.888 0.059087 .
## charactars. -0.0214452 0.0104403 -2.054 0.039977 *
## eat 0.0093626 0.0170596 0.549 0.583134
## enough 0.0038871 0.0168709 0.230 0.817779
## fantastci 0.2072158 0.0190010 10.906 < 2e-16 ***
## fun 0.0776914 0.0101443 7.659 1.94e-14 ***
## get 0.0033358 0.0060001 0.556 0.578249
## money -0.3892819 0.0169312 -22.992 < 2e-16 ***
## photo 0.0191894 0.0145526 1.319 0.187306
## train 0.0431747 0.0131936 3.272 0.001068 **
## want -0.0109111 0.0113792 -0.959 0.337638
## better -0.0400282 0.0136792 -2.926 0.003434 **
## come -0.0068127 0.0137953 -0.494 0.621421
## holiday 0.0251624 0.0163428 1.540 0.123653
## miss 0.0505023 0.0164743 3.066 0.002175 **
## must 0.1231553 0.0157154 7.837 4.79e-15 ***
## save 0.0076759 0.0206132 0.372 0.709613
## say -0.0525152 0.0143651 -3.656 0.000257 ***
## start 0.0260744 0.0165219 1.578 0.114538
## two -0.0339374 0.0120251 -2.822 0.004773 **
## florida -0.0884158 0.0147722 -5.985 2.19e-09 ***
## still 0.0329640 0.0114018 2.891 0.003842 **
## mania -0.0883281 0.0106946 -8.259 < 2e-16 ***
## space 0.0015665 0.0201252 0.078 0.937958
## spend -0.0423408 0.0184742 -2.292 0.021919 *
## spent 0.0019915 0.0183301 0.109 0.913484
## back 0.0063244 0.0106343 0.595 0.552034
## earlier 0.0948149 0.0137389 6.901 5.27e-12 ***
## firework 0.0390859 0.0124611 3.137 0.001711 **
## night 0.0392948 0.0138091 2.846 0.004436 **
## young 0.0201330 0.0189283 1.064 0.287499
## familiar -0.0138351 0.0111688 -1.239 0.215456
## made 0.0020972 0.0167994 0.125 0.900654
## age 0.1007268 0.0167341 6.019 1.77e-09 ***
## help 0.0790047 0.0153683 5.141 2.75e-07 ***
## need -0.0388105 0.0121941 -3.183 0.001461 **
## look -0.0682634 0.0146244 -4.668 3.06e-06 ***
## min -0.0727035 0.0143345 -5.072 3.96e-07 ***
## recommend 0.0551549 0.0151730 3.635 0.000278 ***
## wait -0.0262877 0.0080805 -3.253 0.001142 **
## pass 0.0153463 0.0103876 1.477 0.139589
## half -0.0995098 0.0195573 -5.088 3.64e-07 ***
## smaller 0.0579425 0.0209182 2.770 0.005610 **
## definitaley 0.0740396 0.0169621 4.365 1.28e-05 ***
## book -0.0172191 0.0147060 -1.171 0.241655
## mickey -0.0221618 0.0148356 -1.494 0.135234
## ablaze 0.1097057 0.0165470 6.630 3.42e-11 ***
## california -0.0220489 0.0159785 -1.380 0.167624
## comparable -0.0504678 0.0212761 -2.372 0.017697 *
## didnt -0.0118424 0.0127607 -0.928 0.353399
## first 0.0441951 0.0109528 4.035 5.47e-05 ***
## make 0.0158674 0.0118069 1.344 0.178991
## next. -0.0033712 0.0191650 -0.176 0.860372
## nice -0.0015046 0.0149861 -0.100 0.920028
## sure 0.0452913 0.0155572 2.911 0.003602 **
## year 0.0044386 0.0097341 0.456 0.648403
## adult 0.0304081 0.0154970 1.962 0.049751 *
## beauties 0.1087420 0.0197987 5.492 4.00e-08 ***
## buy 0.0268342 0.0178634 1.502 0.133059
## new 0.0027366 0.0153195 0.179 0.858225
## old -0.0192292 0.0133613 -1.439 0.150115
## wonder 0.1532861 0.0174295 8.795 < 2e-16 ***
## land 0.0106474 0.0135621 0.785 0.432413
## differ 0.0551390 0.0154944 3.559 0.000373 ***
## clean 0.1156974 0.0178121 6.495 8.42e-11 ***
## high 0.0243933 0.0195334 1.249 0.211750
## trip 0.0232014 0.0123032 1.886 0.059332 .
## end -0.0016379 0.0164994 -0.099 0.920926
## found 0.0003821 0.0165908 0.023 0.981627
## hotel 0.0328390 0.0105309 3.118 0.001821 **
## light 0.0581660 0.0193320 3.009 0.002625 **
## bring 0.0486872 0.0176062 2.765 0.005690 **
## everithing 0.0969435 0.0128367 7.552 4.42e-14 ***
## although 0.0697053 0.0192175 3.627 0.000287 ***
## pirate -0.0044723 0.0211620 -0.211 0.832625
## thing -0.0183291 0.0120063 -1.527 0.126867
## use 0.0439529 0.0112816 3.896 9.80e-05 ***
## full -0.0220858 0.0200784 -1.100 0.271352
## right 0.1011022 0.0190383 5.310 1.10e-07 ***
## happier 0.0577608 0.0191657 3.014 0.002583 **
## without 0.0535065 0.0218150 2.453 0.014183 *
## alway 0.1245232 0.0129597 9.608 < 2e-16 ***
## long -0.0371159 0.0109593 -3.387 0.000708 ***
## friend 0.0978418 0.0152262 6.426 1.33e-10 ***
## pariah. -0.0877762 0.0109471 -8.018 1.11e-15 ***
## part -0.0658069 0.0204477 -3.218 0.001291 **
## meet 0.0491927 0.0158316 3.107 0.001890 **
## give -0.0029705 0.0218030 -0.136 0.891631
## watch 0.0214849 0.0175563 1.224 0.221049
## return -0.0427751 0.0190595 -2.244 0.024821 *
## least -0.0580124 0.0191180 -3.034 0.002412 **
## anothe -0.0334258 0.0188120 -1.777 0.075607 .
## adventure 0.0713293 0.0193814 3.680 0.000233 ***
## cant 0.0495708 0.0190164 2.607 0.009146 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.875 on 27828 degrees of freedom
## Multiple R-squared: 0.3199, Adjusted R-squared: 0.315
## F-statistic: 65.12 on 201 and 27828 DF, p-value: < 2.2e-16
We applied the predictions of linear regression to the test sample. The RMSE is 0.8705454.
pred_reg = predict(reg, newdata=test)
rmse_reg = sqrt(mean((pred_reg-test$rating)^2)); rmse_reg
## [1] 0.8705454
Next, we repeated the steps above for the dataset with TF-IDF weight to predict rating.
set.seed(617)
split = sample(1:nrow(disneyland_data_tfidf),size = 0.7*nrow(disneyland_data_tfidf))
train = disneyland_data_tfidf[split,]
test = disneyland_data_tfidf[-split,]
library(rpart); library(rpart.plot)
tree1 = rpart(rating~.,train)
rpart.plot(tree1)
We applied the predictions of the tree to the test sample. The RMSE is 0.9900591.
pred_tree1 = predict(tree1,newdata=test)
rmse_tree1 = sqrt(mean((pred_tree1 - test$rating)^2)); rmse_tree1
## [1] 0.9900591
# the output is the same as the above linear regression, so the output is not included.
reg1 = lm(rating~.,train)
summary(reg1)
We applied the predictions of linear regression to the test sample. The RMSE is 0.8705454.
pred_reg1 = predict(reg1, newdata=test)
rmse_reg1 = sqrt(mean((pred_reg1 - test$rating)^2)); rmse_reg1
## [1] 0.8705454
Two tree models have the same RMSE for test sample, and so do linear regression models. linear regression models have relative lower RMSE than the tree models. Furthermore, two tree models have almost the same coefficients, and so do linear regressions. In the tree models, money, disappoint, and hours are highlighted to predict the rating, and words related to time, such as “time” and “minute” are significant coefficients in linear regression models (these variables have p-value < 2e-16), which also reveals that waiting time is highly valued by the visitors. The time words plot and the decision trees all illustrate that about an-hour waiting time will lead to a lower rating.Food is also selected as a relative significant variable in the linear regression models, having p-values less than 0.05 in two models. This indicates that food-related issues should be valued by Disneyland Theme Park.
In conclusion, the analysis of key factors that visitors care about during their visits and that they give ratings accordingly to different Disneyland branches provides valuable insights to help identify visitors’ preferences, expectations, and behaviors. As Disneyland continues to expand its global presence, it must consider these factors to ensure that each location provides unique and authentic experiences to meet visitors’ expectations, leading to increased visitor satisfaction and loyalty, contributing to the overall success of Disneyland’s global brand. Ultimately, this analysis emphasizes the importance of prioritizing visitor needs and preferences and provides targeted recommendations to support Disneyland to achieve long-term success and maintain its status as a premier entertainment destination.
First, we filtered out some common words that are unnecessary in our study by examining the top 25 words mentioned in the review text. Then, we explored the frequency of top words mentioned by different reviewers coming from different continents. We found that reviewers from different continents mention some same words such as “rides”, “time”, and “kids”, indicating all reviewers care about these topics; and some words appear to be mentioned in different frequencies by reviewers from different continents, for example, “food” is among the top 10 mentioned words in Africa, Asia, Europe, and Oceania reviewers’ reviews. For both the car adventure topic and the food related topic, we first conducted frequency analysis and sentiment analysis using afinn sentiment for the targeted reviewer groups, respectively. We found that the frequency of America and Canada reviewers mentioning car adventure topics was slightly higher than the areas other than America and Canada. This might suggest that America and Canada reviewers tend to mention more about car adventure topics than reviewers from other areas. Moreover, a positive average sentiment score for America and Canada reviews also suggest that these reviewers tend to be happy about their car adventure experiences. The frequency of Asian reviewers mentioning food-related topics was lower than areas other than Asia. This suggests that reviewers from Asian countries do not mention food more frequently than reviewers from non-Asian countries. This implies that Asian visitors may care less about food than visitors from non-Asian countries do when they visit the Disney theme parks. However, a positive average sentiment score for Asian reviews suggests that even Asian reviewers may not care about food as much as other reviewers do, they are overall satisfied about the food in the Disney theme parks they visited.
In the Hong Kong branch, 22.23% of the reviews mentioned shopping topics. In California, only 12.81% mentioned shopping topics, which is much lower than Hong Kong. Surprisingly, 28.27% of the total reviews in the Paris branch have mentioned shopping related topics, which is higher than the Hong Kong branch. We are then able to reject our hypothesis that the Hong Kong branch received similar proportions of shopping related reviews. Moving on to the afinn sentiment analysis. We observed that among all 3 branches, more positive words are in reviews unrelated to shopping experience, and reviews unrelated to shopping experience tend to have more positive tone than reviews that talked about shopping.
Similar proportions of reviews mentioned ride experiences for California and Paris branches. Hong Kong branch has the least proportions of reviews mentioned ride experiences. In Hong Kong, the average rating is higher for those that mentioned the ride than those that did not mention. However, although the California branch tends to receive the highest ranking among all three branches, the average rating is higher for those that did not mention rides than those that mentioned. Paris tends to receive much lower average ratings compared with the other two, even regardless of mentioning rides or not. This might suggest that visitors are as satisfied about the Paris branch in general as other branches, and the ride experiences might have related to such low ratings. As for the sentiment analysis for this hypothesis, we noticed that the difference in proportion of positive words between reviews with ride experience or not is ambiguous, but the average sentiment score for ride-related reviews are much lower, indicating that reviewers may have more extreme emotions toward ride experience.
We classified topic words into four categories: time, theme rides, dining, and customer services, and we want to see reviews mentioning these topics would affect overall rating. By our analysis, we can conclude the following. For reviews mentioning food, we reject the null hypothesis. In low rating reviews, we observed many reviews discussing food quality. For time words, we reject the null hypothesis. Specifically, when a review mentions a one-hour wait time, the overall rating tends to be lower. For most of the ride’s features, we fail to reject the null hypothesis. Meaning that in reviews mentioning specific rides, the features associated with the ride would not affect the review’s overall rating. However, we do find a special case in rides’ drop. Lastly, for reviews mentioning staff, we reject the null hypothesis. We found that in lower rating categories, the proportion of reviews mentioned about staff is larger than in higher rating categories.
Based on results found by previous research questions, recommendations for Disneyland to improve and develop can be divided into two aspects: opening new branches to attract more visitors and improving existing branches to increase overall satisfaction/experience.
As for improving existing parks, recommendations are provided based
on four sections: food, time, ride, and staff.
Food:
1. Disneyland should address the issues of overpricing, limited and
unclean dining options, poor quality/unhealthy food, and long waiting
times. These problems have been highlighted in many negative reviews and
are supported by the predictive model’s analysis. To improve, Disneyland
can consider building more restaurants and offering online ordering to
reduce waiting times. 2. Disneyland could consider expanding and
diversifying its food options, particularly in non-Asian parks where
food is mentioned more frequently. They could also focus on improving
the quality and taste of their current food options to increase overall
visitor satisfaction.
Time:
1. Visitors have expressed dissatisfaction with waiting times of more
than an hour, which has led to lower ratings. Disneyland can try to
reduce waiting times by improving queue management, offering fast pass
or similar systems, or increasing the number of staff during peak
periods.
Ride:
1. Most ride-related factors were found to have no significant impact on
overall ratings, except for the number of drops. High-rated reviews
mentioned more non-drop attractions. Therefore, Disneyland could
consider adding non-drop attractions to improve overall visitor
satisfaction.
2. Disneyland could focus on expanding and diversifying its ride
offerings to attract a wider range of visitors based on different
visitors’ preferences based on different countries. They could also
prioritize the maintenance and upkeep of their current rides to ensure a
high-quality experience for visitors.
Staff:
1. Visitors have complained about employees not doing their job, having
a poor attitude, and not speaking English in the case of Hong Kong
Disneyland. Disneyland should address these issues by providing employee
training in language skills, customer service, and direction to improve
the overall visitor experience.
First, in our dataset, there is only “review_id” which is unique for each review, the “reviewer_id” represents each unique user is not included and remains unknown, we were not able to examine the reviews that one person wrote for different Disneyland park locations, and we were also unable to rule out the possibility of one reviewer leaving multiple reviews for one Disneyland park location. Second, Our analysis only focuses on the reviews towards Disneyland parks located in Paris, Hong Kong, and California; there are Disneyland parks in other cities and countries that are not included in the analysis and the review data was not included and analyzed in our study, therefore the generalizability of our study may be limited. Third, there might exist errors in the process of matching the ride names with the review texts, it is possible that some of the Disney character names were falsely matched with a ride name; additionally, our study was not able to identify the reviews with spelling errors and match them with the correct ride names. Therefore, our analysis may contain a small portion of inaccurate data and may miss some accurate data. Finally, our study did not rule out the impact of visiting time, seasonality, and year on reviews and ratings, since Disneyland parks have peak and off seasons, the time of visit exists as a confounding variable in our study. In addition to it, the overall text analysis could not tell some other sentiments like irony ones, the sentiment analysis is limited by the lack of variety of sentiments.
Based on the sentiment analysis conducted on the Disneyland review dataset, several suggestions for future studies can be made. First of all, future studies could consider selecting more topics to investigate the impact that may have on visitor ratings. For instance, topics such as kids and children experiences could reveal insights on how Disneyland parks create childhood memories and shape family relationships. Similarly, the interaction between travelers and animation characters could also expose the immersive experiences. Fireworks are another core component in Disneyland experiences, further analysis on fireworks could also focus on this element and explore Disneyland’s nighttime experience. Secondly, the tree models used in this study identified the factor “money” as a significant variable, hence, future studies could further investigate the financial factors which would largely impact visitors’ perception of the park. Such an investigation could focus on the costs of tickets, food, merchandise and other expenses outside of the park like nearby hotels and transportation costs. It could also examine how visitors feel about the pricing of these factors and if they feel the expenses are worth it. Thirdly, this study did not take into account the issue of time. Follow-up scholars could divide the time into off-season and peak season for research, like comparing the visitor ratings between these two time ranges to explore the impact of crowds and wait time on visitor experience, and this would make the research more comprehensive. Finally, it’s important to note that the data only comes from three branches of Disneyland. Including data from other branches or Disney parks worldwide could provide a more diverse and comprehensive understanding of the factors that impact visitor ratings. Additionally, it could help identify more regional differences in visitor perceptions of Disney parks and provide valuable insights into how the park can better cater to the needs and preferences of visitors from different parts of the world.
Luo, J., Li, G., Li, G., & Law, R. (2020). Topic modelling for theme park online reviews: analysis of Disneyland. Journal of Travel & Tourism Marketing, 37(2), 272–285. https://doi.org/10.1080/10548408.2020.1740138
Disneyland Reviews. (2021, January 19). Kaggle. https://www.kaggle.com/datasets/arushchillar/disneyland-reviews
Walt Disney World Ride Data - dataset by lynne588. (2023, March 15). Data.world. https://data.world/lynne588/walt-disney-world-ride-data